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The electronic MEdical Records & GEnomics (eMERGE) network was established in 2007 
by the National Human Genome Research Institute (NHGRI) of the National Institutes of 
Health (NIH) in part to explore the utility of electronic medical records (EMRs) in genome 
science. The initial focus was on discovery primarily using the genome-wide association 
paradigm, but more recently, the network has begun evaluating mechanisms to implement 
new genomic information coupled to clinical decision support into EMRs. Herein, we 
describe this evolution including the development of the individual and merged eMERGE 
genomic datasets, the contribution the network has made toward genomic discovery 
and human health, and the steps taken toward the next generation genotype-phenotype 
association studies and clinical implementation. 
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INTRODUCTION 

Revolutions in genotyping technology (Ragoussis, 2009) and 
computational power coupled with the creation of public sci- 
entific resources such as The Human Genome Project (2001; 
Venter et al, 2001), The International HapMap Project (2003; The 
International HapMap Consortium 2005), and most recently the 
1000 Genomes Project (2012), have accelerated genomic discov- 
ery, most commonly through genome-wide association studies 
(GWAS). As of late March 2014, the National Human Genome 
Research Institute (NHGRI) GWAS catalog listed 1201 publica- 
tions with 3961 SNPs associated with approximately 571 human 
diseases and traits at a significance threshold of 5.0 x 10~ 8 
(Welter et al, 2014) (https://www.genome.gov/26525384) 

The majority of genomic discoveries published to date have 
been from case-control or cohort epidemiologic studies that 
collected specific health-related data and DNA samples. These 
traditional epidemiologic collections already exist and are primed 
for genomic discovery studies (Willett et al., 2007), mak- 
ing them ideal for large-scale GWAS. Also, although currently 
under-utilized in genomic discovery, many of the cohorts have 



collected exposure data that can be interrogated for gene- 
environment interaction studies (Manolio et al., 2006; Thomas, 
2010). However, a major disadvantage of accessing existing epi- 
demiologic cohorts for genomic discoveries is limited represen- 
tation of diverse racial/ethnic groups (Rosenberg et al, 2010) and 
of children (Collins and Manolio, 2007). Also, the existing health- 
related data can be limiting, especially for cohorts or case-controls 
collections designed with very specific disease outcomes for study 
such as cancers or cardiovascular disease. Finally, establishing and 
maintaining an on-going cohort study can pose significant cost 
burden (Rukovets, 2013). 

The disadvantages of accessing existing case-control and 
cohort studies coupled with the continued need for genotype- 
phenotype data for genomic discoveries led to the consideration 
of alternative study designs and data sources such as bioreposito- 
ries linked to electronic medical records (EMRs). In addition for 
the potential for large sample sizes of diverse groups, biobanks 
linked to EMRs make possible the study of many different out- 
comes and traits, many of which may not be routinely collected by 
traditional epidemiologic cohorts. And, in this burgeoning era of 
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precision or personalized medicine, biobanks in clinical settings 
offer unprecedented opportunities to quickly translate research 
findings to improvements in patient care. 

In recognition of the potential for EMR-linked biobanks to 
genomic discovery and personalized medicine, NHGRI estab- 
lished the electronic MEdical Records & GEnomics (eMERGE) 
network. The eMERGE network began in 2007 with a 
Coordinating Center (Vanderbilt University) and five study 
sites: Group Health/University of Washington, Marshfield Clinic, 
Mayo Clinic, Northwestern University, and Vanderbilt University 
(McCarty et al., 2011). The network expanded to include new 
adult study sites (The Icahn School of Medicine at Mount 
Sinai and Geisinger Health System) in 2011 as well as pedi- 
atric study sites in 2012 (Children's Hospital of Philadelphia 
and Boston Children's Hospital/Cincinnati Children's Hospital 
Medical Center) (Gottesman et al, 2013). The major goals of 
eMERGE I (McCarty et al, 2011) have evolved with experience, 
and the major activities of the Genomics Work Group of the 
eMERGE II network are outlined in Figure 1 . Here we review 
from the perspective of the eMERGE Genomics Work Group 
the contributions the network has made toward genomic dis- 
covery since 2007. We also foreshadow the eMERGE network's 
contributions to the second generation of genotype-phenotype 
associations as well as implementation of genomic medicine. 

eMERGE GENOMIC RESOURCES 

The first few years of the eMERGE network required data gener- 
ation both at the phenotype and genotype levels (McCarty et al., 
2011; Gottesman et al., 2013). In the first phase of the eMERGE 
network, each study site proposed an outcome or trait for pheno- 
type algorithm development and selection of DNA samples for 
genotyping. Since EMR data are generated for the purposes of 
clinical care, a necessary step to identifying populations of interest 
was to create and validate algorithms that queried data elements 
from the EMR to find phenotypes of interest (Kho et al., 2011; 
Newton et al., 2013). Typically, these algorithms involved Boolean 
combinations of billing codes, medication exposures, laboratory, 
and test results, and/or natural language processing. All algo- 
rithms and their validation results in the eMERGE network are 
available on PheKB (www.phekb.org). 

After validation of phenotype algorithms by blinded review, 
typically by physicians, matching case, and control samples 
were genotyped. All DNA samples were genotyped using either 
the Illumina 660-Quad (primarily for participants of European 
ancestry) or the Illumina 1M (primarily for participants of 
African ancestry) at either the Broad Institute Center for 
Genotyping and Analysis or the Center for Inherited Disease 
Research (CIDR). The eMERGE Coordinating Center established 
a pipeline to process each study site's data for quality control, data 
cleaning, and eventual Database of Genotypes and Phenotypes 
(dbGaP) (Mailman et al., 2007) documentation and deposition 
(Turner et al, 2011a). The initial round of phenotyping and 
genotyping resulted in the generation of GWAS-level data on 
19,637 samples, of which 18,663 passed quality control metrics. 
The phenotypes and samples sizes available from these eMERGE 
phase I efforts included cataracts/HDL-C (2642 cases and 1322 
controls; led by Marshfield Clinic), dementia (1241 cases and 



2043 controls; led by Group Health Cooperative/University of 
Washington), electrocardiographic traits (3034 individuals; led by 
Vanderbilt University), peripheral artery disease (1641 cases and 
1604; controls led by Mayo Clinic), and type 2 diabetes (2706 
cases and 1496 controls; led by Northwestern University). 

During phase I of the eMERGE network, high-density geno- 
typing had matured such that many large cohorts and biorepos- 
itories linked to EMRs had existing GWAS-level data. This 
included expanded genotype datasets at some eMERGE I sites 
and as such, no new high density genome-wide genotyping was 
performed in eMERGE phase II. All existing and new study sites 
in eMERGE II offered existing data on a variety of genotyp- 
ing platforms and genetic ancestries. With the inclusion of the 
eMERGE phase I data, a total of 60,766 (47,507 adult and 13,259 
pediatric) samples with GWAS-level genotypes or other large- 
scale data [such as Metabochip (Voight et al., 2012)] generated 
by either Illumina or Affymetrix arrays are available for study in 
eMERGE phase II. As detailed in a separate manuscript (Verma 
et al., in press), pooling and merging of these data required impu- 
tation and extensive quality control. The current eMERGE phase 
II merged dataset (version 2) available for analysis includes 51,038 
samples linked to EMRs imputed to >36 million SNPs using the 
1000 Genomes Project cosmopolitan reference panel (« = 1092) 
and IMPUTE2 (Verma et al., in press). 

New to eMERGE phase II is the eMERGE-PGx project, which 
involves the targeted sequencing of 84 pharmacogenes identified 
by the Pharmacogenomics Research Network (PGRN) using DNA 
capture and contemporary sequencing technologies (known as 
PGRN-Seq) (Rasmussen-Torvik et al., in press). For this effort, 
each eMERGE II study site is enrolling ~1000 patients as a 
pilot study of pharmacogenetic sequencing in clinical practice. 
Enrollment and sequencing is on-going, and the anticipated 
network- wide sample size is 9000. All variants annotated through 
this effort will be available in summary data form via the eMERGE 
on-line resource "Sequence, Phenotype, and pHarmacogenomics 
integration eXchange" or "SPHINX" (www.emergesphinx.org). 
The eMERGE-PGx project will help establish best practices for 
implementing personalized medicine including exploring and 
establishing guidelines for returning results to physicians and 
patients (Kullo et al., 2014). These data will also contribute toward 
the catalog of rare and less common variants and couple them to 
EMR data which may increase their clinical utility. 

eMERGE GENOMIC DISCOVERIES 

It was recognized early in the phenotype and genotype data gen- 
eration phase of eMERGE I that large sample sizes are needed 
to have sufficient statistical power for genetic association stud- 
ies. Indeed, initial GWAS of single eMERGE study site datasets 
demonstrated that known genotype-phenotype associations such 
as SCN10A and PR duration (Chambers et al, 2010; Holm et al, 
2010; Pfeufer et al., 2010) could be replicated albeit at a signif- 
icance threshold above 5.0 x 10~ 8 (Denny et al., 2010b). While 
this exercise of replication demonstrated that EMR-derived phe- 
notypes could be used in genotype-phenotype studies, genomic 
discovery of new associations would require larger sample sizes. 

To achieve this goal, the eMERGE network employed several 
strategies, including (1) pooled analysis across the network, (2) 
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FIGURE 1 | Major activities of the Genomics Work Group of the eMERGE 
network. Abbreviations: CHOP Children's Hospital of Philadelphia; CCHMC, 
Cincinnati Children's Hospital Medical Center; BCH, Boston Children's 
Hospital; GHC, Group Health Cooperative; UW, University of Washington; 



PSU, Pennsylvania State University; QC, quality control; EMR, electronic 
medical record; PheWAS, phenomewide association study; EWAS, 
environment-wide association study; CNV, copy number variation; PGx, 
pharmacogenomics. 



meta-analysis within and with outside consortia, and (3) gen- 
eration of new phenotype and genotype data for new studies. 
In the first strategy, each eMERGE study site deployed not only 
the phenotype used to select study subjects for the genotype- 
phenotype association studies of the site's primary phenotype, 
but also the phenotype algorithms designed by other sites to 
identify additional cases and controls with existing GWAS-level 
genotyping for these secondary phenotypes, This strategy was 
successful and identified > 15,000 additional samples with exist- 
ing GWAS-level data to be repurposed for other phenotypes. 
This effort to share and deploy phenotype algorithms across 
sites enabled network-wide genomic discoveries for a variety of 
quantitative traits (Table 1) and facilitated data sharing for meta- 
analysis efforts outside of the eMERGE network for complex 
diseases such as late onset Alzheimer's disease (Naj et al., 2011) 
and electrocardiographic traits (Jeff et al, in press). 

Implicit in the eMERGE data sharing strategy is the concept 
that phenotype algorithms are portable across different study sites 
with different EMRs software systems as well as different health 



care practices and cultures (Kho et al., 201 1). Also, it was assumed 
that each study site could reuse data collected for a specific phe- 
notype or trait to conduct studies for other unrelated phenotypes 
without introducing substantial biases. For example, in the type 
2 diabetes (T2D) association study, there was considerable het- 
erogeneity in the proportion of type 2 diabetes cases at each site, 
as well the odds ratio estimates for the index T2D SNP within 
each site's cohort, but when combined across the sites the odds 
ratio was indistinguishable from those using larger purposely- 
collected T2D case-control collections (Kho et al, 2012). These 
data suggest that potential study heterogeneity was magnified or 
measurable at the single study level but dampened at the larger 
network-wide level of analysis. 

To further test the boundaries of these assumptions and 
early observations, eMERGE undertook a network-wide study of 
hypothyroidism, a new phenotype not related to any of the study 
site-specific phenotypes. The phenotype algorithm was developed 
at the Vanderbilt University study site and deployed and evalu- 
ated by all eMERGE study sites, like other eMERGE phenotypes. 
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Table 1 | eMERGE and genomic discovery. 



Phenotype Nearest gene 


Genetic effect 


P 


Study design 


Sample 


References 


(rs number) 


size 




(Population) 


size 




Alzheimer's Disease, late BIN1 


OR = 1.17 


4.2 x 10-14 


Consortium 


8309 cases 


Naj et al., 2011 


onset (rs7561528) 


(95% CI: 1.13, 1.22) 




meta-analysis, replication 


7366 controls 










(EA) 






CD2AP 


OR = 1.11 


8.6 x10-9 


Consortium 


18,762 cases 




(rs9349407) 


(95% CI: 1.07, 1.15) 




meta-analysis, discovery 


29,827 controls 










+ replication 












(EA) 






CD33 


OR = 0.91 


1.6 x10-9 


Consortium 


18,762 cases 




(rs3865444) 


(95% CI: 0.88, 0.93) 




meta-analysis, discovery 


29,827 controls 










+ replication 












(EA) 






CLU 


OR = 0.89 


1.9 x10-8 


Consortium 


8309 cases 




(rs1532278) 


(95% CI: 0.85, 0.93) 




joint-analysis, replication 


7366 controls 










(EA) 






cm 


OR = 1.16 


4.6 x10-10 


Consortium 


8309 cases 




(rs6701713) 


(95% CI: 1.11, 1.22) 




meta-analysis, replication 


7366 controls 










(EA) 






EPHA1 


OR = 0.90 


6.0 x10-10 


Consortium 


18,762 cases 




(rs11767557) 


(95% CI: 0.86, 0.93) 




meta-analysis, discovery 


35,597 controls 










+ replication 












(EA) 






MS4A4A 


OR = 0.88 


1.7 x10-9 


Consortium 


8309 cases 




(rs4938933) 


(95% CI: 0.85, 0.92) 




meta-analysis, discovery 


7366 controls 










+ replication 












(EA) 






PICALM 


OR = 0.87 


7.0 x10-11 


Consortium 


8309 cases 




(rs561655) 


(95% CI: 0.84, 0.91) 




meta-analysis, replication 


7366 controls 










(EA) 






Erythrocyte C1orf63 


P = -0.09 


2 x10-9 


eMERGE joint analysis. 


7607 individuals 


Kulloetal., 2011 


sedimentation rate (rs1043879) 






discovery + replication 












(EA) 






cm 


P = -0.18 


3 x 10-26 


eMERGE joint analysis, 


7607 individuals 




(rs650877) 






discovery + replication 












(EA) 






CRIL 


P = 0.10 


2 x10-9 


eMERGE joint analysis, 


7607 individuals 




(rs7527798) 






discovery + replication 












(EA) 






TMEM50A 


P = -0.10 


2. x 10-13 


eMERGE joint analysis, 


7607 individuals 




(rs25547372) 






discovery + replication 












(EA) 






TMEM57 


P = -0.10 


1 x 10-12 


eMERGE joint analysis, 


7607 individuals 




(rs25631242) 






discovery + replication 












(EA) 






TMEM57 


P = -0.10 


5 x 10-13 


eMERGE joint analysis, 


7607 individuals 




(rs25641524) 






discovery + replication 












(EA) 






HDL-C CETP 


P = 2.25 


1.22 x 10-25 


eMERGE analysis, 


3740 individuals 


Turner etal., 2011b 


(rs3764261) 


(S£ = 0.21) 




replication 












(EA) 






UPC 


P = 2.00 


3.92 x 10-14 


eMERGE analysis, 


3740 individuals 




(rs11855284) 


(SE = 0.26) 




replication 







(EA) 
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Table 1 | Continued 
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(rs number) 


size 




(Population) 


size 




Hypothyroidism 


F0XE1 


OR = 0.74 


3.96 x10-9 


eMERGE joint analysis, 


1317 case 


Denny etal., 2011 




(rs7850258) 


(95% CI: 0.67, 0.82) 




discovery 
(EA) 


5053 controls 




LDL-C 


APOE 


P = -20.0mg/dl 


6.3 x 10-11 


eMERGE joint analysis, 


618 individuals 


Rasmussen-Torvik 




(rs7412) 


(95% CI: -25.9, 
-14.1) 




discovery 
(AA) 




etal., 2012 


Monocyte count 


CCBP2 
(rs2228467) 


P = 0.32 


2.39 x10-8 


eMERGE joint analysis, 

discovery 

(EA) 


11,014 
individuals 


Crosslin et al., 2013 




IRF8 


P = -0.25 


6.32 x10-18 


eMERGE joint analysis, 


11,014 






(rs424971) 






discovery 
(EA) 


individuals 






ITGA4 


P = -0.22 


1.35 x10-14 


eMERGE joint analysis, 


11,014 






(rs2 124440) 






replication 
(EA) 


individuals 






RPN1 


P = -0.22 


4.52 x 10-14 


eMERGE joint analysis, 


11,014 






(rs2712381) 






replication 
(EA) 


individuals 




PheWAS 


EXOC2 


OR= 1.32 


1.9 x10-8 


eMERGE pooled 


13,835 


Denny etal., 2013 




(rs12210050) 


(95% CI: 1.20, 1.45) 




analysis, discovery for 
actinic keratosis 
(EA) 


individuals 






IRF4 


OR= 1.69 (95% CI: 


4.1 x 10-26 


eMERGE pooled 


13,835 






(rs12203592) 


1.53, 1.86) 




analysis, discovery for 
actinic keratosis 
(EA) 


individuals 






IRF4 


OR= 1.50 


3.8 x10-17 


eMERGE pooled 


13,835 






(rs12203592) 


(95% CI: 1.36, 1.64) 




analysis, discovery for 
non-melanoma skin 
cancer 
(EA) 


individuals 






NM37 


OR = 3.71 


2.0 x 10-12 


eMERGE pooled 


13,835 






(rs16861990) 


(95% CI: 2.57, 5.34) 




analysis, discovery for 
hypercoagulable state 
(EA) 


individuals 






TYR 


OR= 1.28 


2.6 x10-10 


eMERGE pooled 


13,835 






(rs 1847 134) 


(95% CI: 1.18, 1.38) 




analysis, discovery for 
non-melanoma skin 
cancer 
(EA) 


individuals 




Platelets 


ARHGEF3 
(rs1354034) 


P = -0.19 


9.0 x 10-34 


eMERGE pooled 
analysis, discovery for 
mean platelet volume 
(EA) 


6291 individuals 


Shameer et al., 2014 




ARHGEF3 


P = 7.97 


6.0 x 10-24 


eMERGE pooled 


13,424 






(rs 1354034) 






analysis, discovery for 
platelet counts 
(EA) 


individuals 






BET1L 


P = -6.46 


5.0 x 10-12 


eMERGE pooled 


13,424 






(rs 11 602954) 






analysis, discovery for 
platelet counts 
(EA) 


individuals 





(Continued) 
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Table 1 | Continued 



Phenotype Nearest gene 


Genetic effect 


P 


Study design 


Sample References 


(rs number) 


size 




(Population) 


size 


DNM3 


P = 0.09 


2.0 x10-8 


eMERGE pooled 


6291 individuals 


(rs2180748) 






analysis, discovery for 










mean platelet volume 










(EA) 




FLJ36031- 


P = -0.15 


5.0 x 10-22 


eMERGE pooled 


6291 individuals 


PIK3CG 






analysis, discovery for 




(rs342240) 






mean platelet volume 










(EA) 




HBS1L-MYB 


P = -5.42 


9.0 x10-10 


eMERGE pooled 


13,424 


(rs4895441) 






analysis, discovery for 


individuals 








platelet counts 










(EA) 




JMJD1C 


P = 0.13 


3.0 x 10-16 


eMERGE pooled 


6291 individuals 


(rs4379723) 






analysis, discovery for 










mean platelet volume 










(EA) 




NFE2 


P = -0.09 


2.0 x10-9 


eMERGE pooled 


6291 individuals 


(rs10506328) 






analysis, discovery for 










mean platelet volume 










(EA) 




RCL1 


P = 4.94 


1.0 x10-9 


eMERGE pooled 


13,424 


(rs423955) 






analysis, discovery for 


individuals 








platelet counts 










(EA) 




SH2B3 


P = -5.33 


5.0 x10-11 


eMERGE pooled 


13,424 


(rs31 84504) 






analysis, discovery for 


individuals 








platelet counts 










(EA) 




TAOK1 


P = 0.10 


1.0 x 10-10 


eMERGE pooled 


6291 individuals 


(rs9900280) 






analysis, discovery for 










mean platelet volume 










(EA) 




TMCC2 


P = 0.11 


3.0 x 10-13 


eMERGE pooled 


6291 individuals 


(rs9660992) 






analysis, discovery for 










mean platelet volume 










(EA) 




WDR66 


P = -0.31 


6.0 x 10-38 


eMERGE pooled 


6291 individuals 


(rs7961894) 






analysis, discovery for 










mean platelet volume 














QRS duration SCN5a 


P = -1.0 


1.45 x10-8 


eMERGE pooled 


5272 individuals Ritchie et al., 2013 


(rs1805126) 






analysis, replication 










(EA) 




Red blood cell traits G6PD 


P = -0.20 


4.0 x 10-13 


eMERGE pooled 


2315 individuals Ding et al., 2013 


(rs1050828) 


(SE = 0.03) 




analysis, discovery + 










replication for RBC count 










(AA) 




G6PD 


P = 2.46 


1.0 x10-14 


eMERGE pooled 


2315 individuals 


(rs1050828) 


(SE = 0.32) 




analysis, discovery + 










replication for mean 










corpuscular volume 










(AA) 




(Continued) 
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Table 1 | Continued 



Phenotype 


Nearest gene 


Genetic effect 


P 


Study design 


Sample 


References 




(rs number) 


size 




(Population) 


size 






G6PD 


P = 0.72 


9.0 x10-9 


eMERGE pooled 


2315 individuals 






(rs1050828) 


(Sf = 0.12) 




analysis, discovery + 
replication for mean 
corpuscular hemoglobin 
(AA) 








ITFG3 


P = -3.57 


5.0 x 10-29 


eMERGE pooled 


2315 individuals 






(rs9924561) 


[SE = 0.32) 




analysis, discovery + 
replication for mean cell 
volume 
(AA) 








ITFG3 


P = -1.56 


8.0 x 10-36 


eMERGE pooled 


2315 individuals 






(rs9924561) 


(Sf = 0.12) 




analysis, discovery + 
replication for mean 
corpuscular hemoglobin 
(AA) 








ITFG3 


p = -0.47 


4.0 x 10-13 


eMERGE pooled 


2315 individuals 






(rs9924561) 


(SE = 0.06) 




analysis, discovery + 
replication for mean 
corpuscular hemoglobin 
concentration 
(AA) 








(rs7120391) 


P = 0.30 
{SE = 0.05) 


5.0 x10-9 


eMERGE pooled 
analysis, discovery + 
replication for mean 
corpuscular hemoglobin 
concentration 
(AA) 


2315 individuals 




Red blood cell traits 


CDT1 
(rs837763) 


-0.06 


2.0 x10-8 


eMERGE pooled 
analysis, discovery + 
replication for mean 
corpuscular hemoglobin 
concentration 
(EA) 


12,486 
individuals 


Ding et al., 2012 




PTPLAD 1/ 


0.13 


8.0 x10-9 


eMERGE pooled 


12,486 






C15orf44 






analysis, discovery + 


individuals 






(rs8035639) 






replication for mean 
corpuscular hemoglobin 
(EA) 








THRB 


0.35 


6.0 x10-9 


eMERGE pooled 


12,486 






(rs93 10736) 






analysis, discovery + 
replication for mean 
corpuscular volume 
(EA) 


individuals 






(rs9937239) 


0.06 


2.0 x10-8 


eMERGE pooled 
analysis, discovery + 
replication for mean 
corpuscular hemoglobin 
concentration 
(EA) 


12,486 
individuals 




Type 2 diabetes 


TCF7L2 
(rs7903146) 


OR= 1.41 


2.98 x10-10 


eMERGE meta-analysis, 

replication 

(EA) 


2413 cases 
2392 controls 


Khoetal., 2012 
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Table 1 | Continued 
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White blood cell count 


DARC 


P = 1.28 


4.92 x10- 


-24 


eMERGE joint analysis, 


361 individuals 


Crosslin et al. 


2012 




(rs12075) 


(S£ = 0.12) 






discovery 


















(AA) 








White blood cell count 


GSDMA 


P = 0.14 


1.75 x10- 


■12 


eMERGE joint analysis, 


13,562 


Crosslin et al. 


2012 




(rs3859192) 


(SE = 0.02) 






discovery 


individuals 
















IfcA) 










MED24 


P = -0.13 


4.92 x10- 


-10 


eMERGE joint analysis, 


13,562 








(rs9916158) 


(SE = 0.02) 






discovery 


individuals 
















(EA) 










PSMD3 


P = 0.14 


3.47 x10- 


-11 


eMERGE joint analysis, 


13,562 








(rs4065321) 


(SE = 0.02) 






discovery 


individuals 
















(EA) 









The eMERGE network has conducted or contributed data toward genome-wide association studies. For each study with genome-wide significant results 
(p < 5 x ICr 8 ), we list the primary phenotype, the nearest genes associated, the index rs number, the reported genetic effect size, the p-value, the study design, 
the population, the sample size, and the reference. Abbreviations: AA, African American; EA, European American; fi, beta; CI, confidence interval; OR, odds ratio; 
SE, standard error. 



Despite potential differences in billing and coding practices across 
study sites, a total of 1317 cases and 5053 controls were identi- 
fied with average weighted positive predictive values of 92.4 and 
98.5, respectively (Denny et al, 2011). The subsequent GWAS 
identified common genetic variants near FOXE1 associated with 
European American cases, and the findings were replicated in an 
independent dataset from the Mayo Genome Consortia as well 
as externally in the literature (Eriksson et al., 2012). These stud- 
ies illustrate that existing genotype data linked to EMR data can 
be reused for other genomic discovery studies, a potentially cost- 
effective strategy. However, further study is needed to determine 
the extent of biases that were introduced in the generation of these 
data that may impact the widespread adoption of this strategy 
across a range of phenotypes available in the EMR. 

As evident in the FOXEi/hypothyroidism example, existing 
genotype data linked to EMR data enable the relatively rapid 
identification of cases and controls for traditional GWAS where 
one disease or trait is studied. These data have also enabled 
the study of pleiotropy, whereby a genetic variant influences or 
impacts multiple phenotypes or traits (Stearns, 2010; Solovieff 
et al, 2013). In one popular approach, known as phenome- 
wide association studies or PheWAS, a GWAS-identified variant 
is interrogated for other associations throughout the available 
phenome. PheWAS has been performed in both epidemiologic 
(Pendergrass et al, 2013a) and EMR-based datasets such as 
eMERGE (Denny et al, 2010a, 2013). Collectively, these and other 
data (Sivakumaran et al, 2011) suggest that pleiotropy among 
GWAS-identified variants is not uncommon. PheWAS con- 
ducted in the EMR setting can reveal novel genotype-phenotype 
pleiotropic relationships not possible in traditional epidemio- 
logic cohorts. For example, a recent PheWAS in the eMERGE 
participants of European ancestry revealed a potential associ- 
ation between actinic keratosis and IRF4 rsl2203592 (Denny 
et al., 2013) (Table 1), a GWAS-identified variant previously 
associated with hair color, eye color, and non-melanoma skin 



cancer (Han et al, 2008; Eriksson et al., 2010; Zhang et al., 

2013) . 

Much like its contributions toward the study of pleiotropy, 
the eMERGE network is beginning to make substantial contribu- 
tions to understudied or burgeoning areas of interest in genomic 
discovery such as the study of pediatric populations and diverse 
racial/ethnic groups. Indeed, with the addition of the pediatric 
study sites, eMERGE II boasts one of the largest collections of 
pediatric DNA samples linked to EMRs for genomic discovery 
(Gottesman et al, 2013). The current version (2) of the merged, 
imputed eMERGE II dataset includes > 12,000 pediatric samples 
linked to EMRs. As of March 15, 2014, fewer than 5% of the 
GWAS annotated by the NHGRI GWAS Catalog (Welter et al, 

2014) mention children as a study population, highlighting the 
tremendous opportunity for genomic discovery in this cohort. 
To calibrate the eMERGE II datasets, a site-specific investigation 
was recently performed for body mass index (BMI) z-scores using 
BMI extracted from the pediatric EMRs and calculated using 
the Centers for Disease Control and Prevention (CDC) growth 
charts (Namjou et al., 2013). Similar to epidemiologic datasets 
(Frayling et al., 2007; Meyre et al., 2009; Scherag et al, 2010), this 
EMR-based study demonstrated that adult GWAS-identified obe- 
sity variants such as those in FTO were also relevant for children 
of European-descent (Namjou et al., 2013). Genomic discovery 
using GWAS in pediatric populations is currently underway in 
eMERGE II for complex phenotypes such as autism and asthma. 

In the past several years, most GWAS have included indi- 
viduals of European ancestry (Rosenberg et al., 2010). Indeed, 
only approximately 10% of the GWAS annotated in the NHGRI 
GWAS Catalog include populations of African ancestry (https:// 
www.genome.gov/26525384). The eMERGE network is signifi- 
cantly poised to contribute to GWA studies for populations of 
non-European ancestry given that several study sites (notably 
Northwestern University, Vanderbilt University, and The Icahn 
School of Medicine at Mount Sinai) include participants of 
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African ancestry. eMERGE I has already contributed genome- 
wide associated variants (at a threshold of p < 10~ 5 ) in par- 
ticipants of African ancestry to the NHGRI GWAS Catalog for 
LDL-C (Rasmussen-Torvik et al., 2012), red blood cell traits 
(Ding et al, 2013), white blood cell traits (Crosslin et al, 2012), 
type 2 diabetes (Kho et al., 2012), and electrocardiographic traits 
(Jeff et al, 2013). As an extension of GWAS, eMERGE investiga- 
tors have also begun fine-mapping GWAS-identified regions to 
identify the best index variant in African ancestry populations as 
well as exploring alternative genomic discovery methods such as 
admixture mapping to identify potentially novel or population- 
specific associations (Jeff et al., 2014). 

Beyond conventional GWAS, the eMERGE network has also 
led efforts to identify genetic (G x G) and environmental (G x E) 
modifiers of common, complex phenotypes. In an early example, 
eMERGE investigators used extrinsic biological knowledge via the 
Biofilter algorithm (Bush et al., 2009) to prioritize genetic vari- 
ants for SNP-SNP modeling to identify gene-gene interactions 
relevant for HDL-C (Turner et al., 2011b). The extrinsic biolog- 
ical knowledge approach has also been recently implemented for 
both G x G and G x E tests of association for cataracts, with the 
latter including only environmental variables known to be asso- 
ciated with the eye disease (Pendergrass et al, 2013b, c). Finally, 
eMERGE investigators have implemented environmental-wide 
association studies (EWAS) to identify and prioritize environ- 
mental factors important for type 2 diabetes (Hall et al., 2014), 
a relatively new approach to identify all possible environmental 
variables that may be relevant for G x E studies for the disease of 
interest. 

eMERGE SECOND GENERATION GWAS 

The majority of GWAS described to date for the eMERGE net- 
work represent data and efforts from phase I of the network's 
existence. Phase II analyses of larger, more diverse sample sizes are 
on-going (Gottesman et al, 2013). As documented and described 
in an accompanying article (Verma et al., in press), eMERGE 
II network datasets include single site datasets, a network- 
wide merged genotyped dataset, single site imputed datasets, 
and a network-wide merged imputed dataset; the merged set 
includes >36 million SNPs for samples from > 50,000 indi- 
viduals linked to EMRs. Imputation of the X-chromosome is 
underway, and future eMERGE II analyses will include this 
chromosome. Network-wide efforts are also underway to anno- 
tate copy number variants (Connolly et al., 2014) as well as 
to annotate and identify potentially deleterious null variants. 
Site-specific efforts are also underway to collect or extract addi- 
tional standardized environmental data for GxE studies using 
the PhenX Toolkit (Hamilton et al, 2011; McCarty et al., 
2014). Efforts are underway to develop analytical approaches 
for repeated measures data characteristic of the EMR, to con- 
duct mapping studies for populations with three-way admixture 
events, and to incorporate phenotyping uncertainty when bal- 
ancing sample size/power and misclassification (McDavid et al., 
2013). With >36 million SNPs, large sample sizes, and phe- 
notypically dense EMRs, eMERGE II and beyond promises 
to continue genomic discovery in the second generation of 
GWAS. 
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